Counting common substrings effectively

نویسندگان

Stanislaw Goldstein

Piotr Beling

چکیده

This article presents effective (dynamic) algorithm for solving a problem of counting the number of substrings of given string which are also substrings of second string. Presented algorithm can be used for example for quick calculation of strings similarity measure using generalized ngram method (Niewiadomski measure [2]), which are shown. Correctness and complexity analyses are included. 1 Oznaczenia Jeśli przez w, s będą oznaczone słowa, to: • |w| długość słowa w (liczba liter), • litery w słowie w mają indeksy od 0 do |w| − 1, • w[a..b] (dla a, b ∈ N) podsłowo słowa w od litery a-tej do b-tej (włącznie) lub słowo puste gdy a > b, • w[a..b) (dla a, b ∈ N) podsłowo w[a..b− 1] słowa w, • w[a..) (dla a ∈ N) podsłowo w[a..|a|) (sufiks słowa w rozpoczynający się od a-tej litery), • w[a] (dla a ∈ N) a-ta litera słowa w (w[a] = w[a..a]), • ws konkatenacja słów w i s, tj. słowo długości |w| + |s|, takie, że (ws)[0..|w|) = w i (ws)[|w|..) = s; • zachodzi w = w[0..|w| − 1] = w[0..|w|) = w[0..) = w[0]w[1] . . . w[|w| − 1].

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Template Discovery Algorithm by Substring Amplification

In this paper, we consider to find a set of substrings common to given strings. We define this problem as the template discovery problem which is, given a set of strings generated by some fixed but unknown pattern, to find the constant parts of the pattern. A pattern is a string over constant and variable symbols. It generates strings by replacing variables into constant strings. We assume that...

متن کامل

Linear-Space Substring Range Counting over Polylogarithmic Alphabets

Bille and Gørtz (2011) recently introduced the problem of substring range counting, for which we are asked to store compactly a string S of n characters with integer labels in [0, u], such that later, given an interval [a, b] and a pattern P of length m, we can quickly count the occurrences of P whose first characters’ labels are in [a, b]. They showed how to store S in O(n logn/ log logn) spac...

متن کامل

Forbidden substrings on weighted alphabets

In an influential 1981 paper, Guibas and Odlyzko constructed a generating function for the number of length n strings over a finite alphabet that avoid all members of a given set of forbidden substrings. Here we extend this result to the case in which the strings are weighted. This investigation was inspired by the problem of counting compositions of an integer n that avoid all compositions of ...

متن کامل

Estimating phylogenetic distances between genomic sequences based on the length distribution of k-mismatch common substrings

Various approaches to alignment-free sequence comparison are based on the length of exact or inexact word matches between two input sequences. Haubold et al. (2009) showed how the average number of substitutions between two DNA sequences can be estimated based on the average length of exact common substrings. In this paper, we study the length distribution of k-mismatch common substrings betwee...

متن کامل

GaKCo: A Fast Gapped k-mer String Kernel Using Counting

String Kernel (SK) techniques, especially those using gapped k-mers as features (gk), have obtained great success in classifying sequences like DNA, protein, and text. However, the state-of-the-art gk-SK runs extremely slow when we increase the dictionary size (Σ) or allow more mismatches (M). This is because current gk-SK uses a trie-based algorithm to calculate cooccurrence of mismatched subs...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

CoRR

دوره abs/1209.4771 شماره

صفحات -

تاریخ انتشار 2012

Counting common substrings effectively

نویسندگان

چکیده

منابع مشابه

A Template Discovery Algorithm by Substring Amplification

Linear-Space Substring Range Counting over Polylogarithmic Alphabets

Forbidden substrings on weighted alphabets

Estimating phylogenetic distances between genomic sequences based on the length distribution of k-mismatch common substrings

GaKCo: A Fast Gapped k-mer String Kernel Using Counting

عنوان ژورنال:

اشتراک گذاری